Skip to content

Install a separate driver instance for each test#201

Merged
k8s-ci-robot merged 1 commit into
kubernetes-sigs:mainfrom
willie-yao:separate-driver-install
May 20, 2026
Merged

Install a separate driver instance for each test#201
k8s-ci-robot merged 1 commit into
kubernetes-sigs:mainfrom
willie-yao:separate-driver-install

Conversation

@willie-yao
Copy link
Copy Markdown
Contributor

Each It block now installs its own DRA driver release with a unique driver name, and uninstalls it on teardown.

Fixes #171

Changes:

  • New test/e2e/driver_test.go with installDriver(ctx, DriverConfig) helper. Shells out to helm (already required by setup-e2e.sh); honors HELM_CHART_PATH env var.
  • deployManifest accepts a DriverConfig and substitutes gpu.example.com -> per-test driver name (and example.com/gpu -> per-test extended resource name) in demo manifests before applying.
  • Beyond helm --wait, polls explicitly for DeviceClass + ResourceSlices + webhook readiness before returning from install.
  • Per-test DeferCleanup LIFO: driver-log diagnostics -> helm uninstall -> namespace delete with foreground propagation + termination wait.
  • setup-e2e.sh no longer installs the driver. Cert-manager + kind cluster + image build stay (cluster-wide infra).
  • Webhook specs share one driver pinned to gpu.example.com and run Ordered, Serial so their static testdata stays valid.

@k8s-ci-robot k8s-ci-robot requested review from byako and pohly May 13, 2026 17:42
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels May 13, 2026
Comment thread test/e2e/driver_test.go

// verifyWebhook waits until the validating webhook is serving for the given
// DeviceClass by creating a dry-run ResourceClaim until it succeeds.
func verifyWebhook(ctx context.Context, deviceClassName string) {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code is mostly moved from e2e_setup_test.go at line 107: https://github.com/willie-yao/dra-example-driver/blob/main/test/e2e/e2e_setup_test.go#L107

Changes: now takes a deviceClassNAme parameter, the test claim's Name is suffixed with it due to parallel webhook installs and added DeviceClass to error messages

Comment thread test/e2e/driver_test.go Outdated
Comment on lines +211 to +237
driverPods, err := clientset.CoreV1().Pods(cfg.Namespace).List(ctx, metav1.ListOptions{
LabelSelector: driverPodSelector,
})
if err != nil {
fmt.Fprintf(GinkgoWriter, "Failed to list driver pods: %v\n", err)
return
}
for _, pod := range driverPods.Items {
for _, c := range pod.Spec.Containers {
stream, err := clientset.CoreV1().Pods(cfg.Namespace).GetLogs(pod.Name, &corev1.PodLogOptions{
Container: c.Name,
TailLines: &tailLines,
}).Stream(ctx)
if err != nil {
fmt.Fprintf(GinkgoWriter,
"Driver pod %s, container %s: failed to get logs: %v\n",
pod.Name, c.Name, err)
continue
}
buf := new(bytes.Buffer)
_, _ = io.Copy(buf, stream)
stream.Close()
fmt.Fprintf(GinkgoWriter,
"Driver pod %s, container %s (last %d lines):\n%s\n",
pod.Name, c.Name, tailLines, buf.String())
}
}
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@willie-yao
Copy link
Copy Markdown
Contributor Author

/assign @nojnhuh

Comment thread test/e2e/driver_test.go Outdated
@k8s-ci-robot k8s-ci-robot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels May 13, 2026
Comment thread .github/workflows/lint.yaml Outdated
Comment thread test/go.mod Outdated
@nojnhuh nojnhuh moved this from 🆕 New to 👀 In review in Dynamic Resource Allocation May 13, 2026
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 13, 2026
Copy link
Copy Markdown
Contributor

@nojnhuh nojnhuh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the annoying comments on the workspace/lint periphery here. I promise I'm starting to look at the main point of this PR.

Comment thread Makefile Outdated
Comment on lines +82 to +83
golangci-lint run ./...
cd test && golangci-lint run --build-tags=e2e ./...
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This hides lint failures in the test module if any issues are found in the main module. Can we combine into one call?

Suggested change
golangci-lint run ./...
cd test && golangci-lint run --build-tags=e2e ./...
golangci-lint run ./...
golangci-lint run --build-tags=e2e ./... ./test/...

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tried combining it with golangci-lint run --build-tags=e2e ./... ./test/... but it silently doesn't lint the test module. It logs pattern ./test/...: directory prefix test does not contain main module or its selected dependencies and exits 0 with 0 issues. I'm thinking to do it by just creating another make target for test

.PHONY: lint lint-root lint-test
lint:
	@$(MAKE) -k lint-root lint-test

lint-root:
	golangci-lint run ./...

lint-test:
	cd test && golangci-lint run --build-tags=e2e ./...

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With a workspace set up, it seems to work for me: willie-yao/dra-example-driver@separate-driver-install...nojnhuh:dra-example-driver:workspace-lint

% make lint
golangci-lint run --build-tags=e2e ./... ./test/...
cmd/dra-example-webhook/main.go:183:1: Comment should end in a period (godot)
// function
^
test/e2e/e2e_setup_test.go:54:1: Comment should end in a period (godot)
// driverPodSelector finds kubelet plugin Pods within an installed driver's release namespace
^
2 issues:
* godot: 2
make: *** [Makefile:83: lint] Error 1

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah that was completely my bad. I had v0.0.0 set before when testing something. Setting it to v0.2.1 fixed it! Thanks for checking this for me

Comment thread .github/workflows/lint.yaml Outdated
Comment thread test/go.mod Outdated
k8s.io/api v0.36.0
k8s.io/apimachinery v0.36.0
k8s.io/client-go v0.36.0
sigs.k8s.io/dra-example-driver v0.0.0
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

go mod tidy seems to want to put v0.2.1 here if this doesn't already exist. That should still ultimately resolve to the local version though and not that actual tagged version.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I confirmed locally that with replace => ../ present, v0.2.1 or any other version resolves to the local checkout. Should I change this to v0.2.1?

Comment thread test/go.mod Outdated
Comment thread Makefile Outdated
@willie-yao willie-yao force-pushed the separate-driver-install branch from e5f08a6 to 27b36c3 Compare May 15, 2026 17:17
Comment thread test/e2e/driver_test.go Outdated
Comment on lines +140 to +143
// runUpgradeOrInstall emulates `helm upgrade --install`: if a release with
// the configured name already exists it is upgraded in place, otherwise a
// fresh install is performed.
func runUpgradeOrInstall(ctx context.Context, actionCfg *action.Configuration, install *action.Install, chrt *chart.Chart, values map[string]any) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In practice, will this ever actually perform an upgrade of an existing Helm release? If we expect each test to have a separate instance of the driver, then I think we'd want to error if that driver already exists instead of upgrade and risk affecting a different test.

The script used helm upgrade --install before mostly to be copy-pasteable, but that was always expected to first be run on a fresh kind cluster without the driver installed where a plain helm install should have worked.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah you're right, this actually neds up hiding the errors and was leftover from trying to replicate helm upgrade --install. I'll remove it and use install.RunWithContext directly.

Comment thread test/e2e/driver_test.go Outdated
actionCfg := newHelmActionConfig(cfg.Namespace, registryClient)

install := action.NewInstall(actionCfg)
install.ReleaseName = cfg.ReleaseName
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we automatically generate a release name instead? That will make sure there aren't conflicts between tests, tests don't have to decide for themselves what the driver should be called, and we don't have to validate them up front.

Comment thread test/e2e/driver_test.go Outdated
Expect(validation.IsDNS1123Label(cfg.ReleaseName)).To(BeEmpty(),
"DriverConfig.ReleaseName %q must be a valid DNS-1123 label", cfg.ReleaseName)
if cfg.Namespace == "" {
cfg.Namespace = cfg.ReleaseName
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like the release name, can we generate a new, separate namespace for each instance of the driver?

Comment thread test/e2e/driver_test.go Outdated
Comment on lines +61 to +64
// maxDriverNameLen is the longest driver name that fits within Linux's
// 108-byte UNIX_PATH_MAX after the kubelet appends its registrar socket
// prefix and per-pod UID suffix.
maxDriverNameLen = 28
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you expand on how this value was calculated?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TBH copilot came up with this and I don't think it's necessary for the E2E test since we auto-generate default driver names that won't go over the limit anyways. I'll go ahead and remove it

Comment thread test/e2e/e2e_test.go Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it simplify things if we construct these ResourceClaims inline vs. reading them from a file?

// with the per-test driver name and (when set) extended resource name.
func substituteDriverIdentifiers(raw string, drv DriverConfig) string {
GinkgoHelper()
out := strings.ReplaceAll(raw, defaultDeviceClassName, drv.DriverName)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eventually, the manifests will be referencing other drivers in addition to gpu.example.com. Some manifests may even rely on multiple separate instances of the driver running different profiles. Not a blocker for this PR, but we should find a better way to handle dynamic driver names.

@nojnhuh
Copy link
Copy Markdown
Contributor

nojnhuh commented May 18, 2026

FYI @willie-yao A couple of dependabot PRs merged which conflict with this PR.

@k8s-ci-robot k8s-ci-robot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels May 18, 2026
@willie-yao willie-yao force-pushed the separate-driver-install branch from aa4b2eb to 833008d Compare May 18, 2026 20:05
Copy link
Copy Markdown
Contributor

@nojnhuh nojnhuh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more nit, otherwise LGTM.

Could you also check that all the mod/work/sum files are up to date?

go mod tidy && go -C test mod tidy
go work sync
go mod tidy && go -C test mod tidy # again

Comment thread test/e2e/driver_test.go Outdated
Comment on lines +143 to +146
Expect(cfg.ReleaseName).To(BeEmpty(),
"DriverConfig.ReleaseName is populated by installDriver; callers must not set it")
Expect(cfg.Namespace).To(BeEmpty(),
"DriverConfig.Namespace is populated by installDriver; callers must not set it")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Making these available to tweak and then checking to make sure they weren't tweaked isn't ideal. Could we set this in a separate local variable in installDriver and pass that around separately from the DriverConfig?

@nojnhuh
Copy link
Copy Markdown
Contributor

nojnhuh commented May 18, 2026

Could you also check that all the mod/work/sum files are up to date?

go mod tidy && go -C test mod tidy
go work sync
go mod tidy && go -C test mod tidy # again

Having written this out, this sounds kind of insane and I'm open to the possibility that separate modules is going to be more pain going forward than it's worth. We could try sticking with one module for now and see if anybody runs into any issues. Splitting things up later will probably be easier than putting them back together.

@willie-yao
Copy link
Copy Markdown
Contributor Author

Having written this out, this sounds kind of insane and I'm open to the possibility that separate modules is going to be more pain going forward than it's worth. We could try sticking with one module for now and see if anybody runs into any issues. Splitting things up later will probably be easier than putting them back together.

Yeah you're right. I noticed the go.work and go.sum files kept going out of date even when doing go mod tidy... I can go ahead and refactor into one module if you think that's a better way to go about things.

@willie-yao willie-yao force-pushed the separate-driver-install branch from 7e93526 to c6421b6 Compare May 18, 2026 21:58
@nojnhuh
Copy link
Copy Markdown
Contributor

nojnhuh commented May 18, 2026

Having written this out, this sounds kind of insane and I'm open to the possibility that separate modules is going to be more pain going forward than it's worth. We could try sticking with one module for now and see if anybody runs into any issues. Splitting things up later will probably be easier than putting them back together.

Yeah you're right. I noticed the go.work and go.sum files kept going out of date even when doing go mod tidy... I can go ahead and refactor into one module if you think that's a better way to go about things.

Yeah, I'm thinking let's go back to one module for now. Trying it here was still a worthwhile experiment nonetheless.

Copy link
Copy Markdown
Contributor

@nojnhuh nojnhuh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

/hold in case we want to try to use Helm v4

Comment thread go.mod Outdated
github.com/stretchr/testify v1.11.1
github.com/urfave/cli/v2 v2.27.7
google.golang.org/grpc v1.81.1
helm.sh/helm/v3 v3.21.0
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a v4 for Helm which I don't think has too many changes from v3. Can we use that here? If it turns out to be more work, let's file an issue to update this since I'm not sure how long v3 will get security updates.

@k8s-ci-robot k8s-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. lgtm "Looks good to me", indicates that a PR is ready to be merged. labels May 20, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

LGTM label has been added.

DetailsGit tree hash: 4030de44291eece73f23b4bab4d9fda21c399085

@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: nojnhuh, willie-yao

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 20, 2026
@nojnhuh
Copy link
Copy Markdown
Contributor

nojnhuh commented May 20, 2026

@willie-yao Could you also please squash?

Signed-off-by: William Yao <william2000yao@gmail.com>
@willie-yao willie-yao force-pushed the separate-driver-install branch from a78fddd to a85fb42 Compare May 20, 2026 16:30
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 20, 2026
@k8s-ci-robot k8s-ci-robot requested a review from nojnhuh May 20, 2026 16:30
@nojnhuh
Copy link
Copy Markdown
Contributor

nojnhuh commented May 20, 2026

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 20, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

LGTM label has been added.

DetailsGit tree hash: 378fbf454ad4408d9c7d8b34703fe2bd4383b985

@willie-yao
Copy link
Copy Markdown
Contributor Author

Added helm v4
/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 20, 2026
@k8s-ci-robot k8s-ci-robot merged commit e94a45a into kubernetes-sigs:main May 20, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Improve e2e test isolation, error reporting, and extensibility

4 participants